Murder Accountability Project: Homicide

In [1]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
Out[1]:

1. Business Understanding


1.1 Purpose

Homocide Data

The Murder Accountability Project is a nonprofit organization that discovers discrepancies between the reported homicides of medical examiners and the FBI voluntary crime report. The database is considered to be one of the most exhaustive collection of homicide records that is currently avaiable in the United States. Additional information about the Murder Accountability Project organization can be found at Murder Accountability Project.

This dataset was collected to investigate the failure of reporting homicides by law enforcement agencies. The dataset is for the timeframe of 1980 to 2014 and includes demographic information such as gender, age, race, and ethnicity for both victims and perpetrators. A more in depth description of the attributes may be found in the Data Description section.

In the midst of documentaries, cold case files, etc., the many homicide cases which go unsolved are placed under a spotlight for examination. Each year, according to the Murder Accountability Project, an estimated ~5000 murderers get away with murder, with the rate increasing to nearly 1/3 of the homicides reported. While hundreds of thousands of Americans are murdered, many are unaccounted for due to the lack of documentation for failed homicide cases. The dataset and the organizations focus is to educate the public on the significance of unsolved cases.

1.2 Data Significance

  1. Predicting whether a crime will be solved or go unsolved
    We have chosen from this dataset to predict if a homicide will be solved or not solved using the variables collected during the intial data recorded during the filing of the homicide.

  2. Predict perpetrators of homicides
    We have also chosen from this dataset to predict the demographics of a perpetrator (age, race, sex, and ethnicity) using the variables collected during the intial data recorded during the filing of the homicide. Predicting a perpetrators' demographics may result in leads which causes a case to be solved. Sensitivity towards police force bias, ethics of profiling, and those falsely incarcerated are a high priority. While predicting who the perpetrator may be can lead to an increased rate of solved cases. Careful consideration will be given to avoid wrongfully acussing a suspect.

1.3 Algorithm Effectiveness

Both response variables are categorical. K-fold cross validation (CV) will be used to evaluate the effectiveness of the classification prediction algorithm and measured using ROC Curve, Accuracy, and Sensitivity & Specificity.

  1. K-fold cross validation (CV)
    Cross validation allows for the training and testing datasets to be separate. This prevents the bias of the training data's accuracy to skew the true prediction capability of the model. By performing CV, this ensures that the resulting accuracies are not skewed.
  2. ROC Curve
    Receiver Operating Characteristics curve determines the prediction quality with respect to the predictors. This allows the investigrator to reduce dimensionality and complexity, while maintaining a high quality model.
  3. Accuracy
    Accuracy will determine the model's predicting capabilities.
  4. Sensitivity and Specificity
    While accuracy determines the model's overall predictive capabilities. Predicting classes accurately should also be emphasized. An example would be a skewed dataset which has 80% class A and 20% class B. If the model classifies all records as class A, an accuracy of 80% initially appears promising, until realizing that no records will be correctly predicted as class B. Sensitivity and specificity will ensure the accuracy of the class distribution.

The output below displays the initial data.

In [2]:
# IMPORT LIBRARIES
# hide warnings
import warnings
warnings.filterwarnings('ignore')

# all imported libraries used for analysis
import numpy as np
import pandas as pd 
import os 
import urllib
import copy
import plotly 
import matplotlib.pyplot as plt
%matplotlib inline 
import seaborn as sns 
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import statsmodels.api as sm
import random
import random
import us

from geopy.geocoders import Nominatim
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.utils import resample
from sklearn.feature_selection import RFE
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from datetime import datetime
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import classification_report
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import confusion_matrix
from pandas.plotting import scatter_matrix

# set color scheme and style for seaborn
sns.set(color_codes=True)
sns.set_style('whitegrid')
In [3]:
# Read the database.csv file and store in a dataframe
df=pd.read_csv('../Data/database.csv')
In [4]:
# take a peak at the dataframe to validate the dataframe is populated
df.head()
Out[4]:
Record ID Agency Code Agency Name Agency Type City State Year Month Incident Crime Type ... Victim Ethnicity Perpetrator Sex Perpetrator Age Perpetrator Race Perpetrator Ethnicity Relationship Weapon Victim Count Perpetrator Count Record Source
0 1 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 January 1 Murder or Manslaughter ... Unknown Male 15 Native American/Alaska Native Unknown Acquaintance Blunt Object 0 0 FBI
1 2 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 March 1 Murder or Manslaughter ... Unknown Male 42 White Unknown Acquaintance Strangulation 0 0 FBI
2 3 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 March 2 Murder or Manslaughter ... Unknown Unknown 0 Unknown Unknown Unknown Unknown 0 0 FBI
3 4 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 April 1 Murder or Manslaughter ... Unknown Male 42 White Unknown Acquaintance Strangulation 0 0 FBI
4 5 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 April 2 Murder or Manslaughter ... Unknown Unknown 0 Unknown Unknown Unknown Unknown 0 1 FBI

5 rows × 24 columns

2. Data Meaning Type


The total number of records and attributes in the originating dataset.

In [5]:
# print the number of records and attributes in the dataframe
records = len(df)
attributes = df.columns
print(f'Total Number of Records: {records} \nTotal Number of Attributes: {len(attributes)}')
Total Number of Records: 638454 
Total Number of Attributes: 24


The table below displays the attribute name, description and data type. In terms of unique classes, Agency Name and Agency Code both are representations of the same data. During the data cleaning phase, specific columns are removed due to lack of significance or restatement of information.

While reviewing the attributes, an interesting note is that the victim count and perpetrator count are the additional persons involved in the crime. Each victim will have their own record, in addition with being accounted for in the column, resulting in the number of records equating to the number of victims. Perpetrators, in contrast, do not have separate records, resulting in the sum of perpetrator count, with the unique cases to be the total number of perpetrators.

In [6]:
# summary of the variables, counts, null state, and data type
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 638454 entries, 0 to 638453
Data columns (total 24 columns):
 #   Column                 Non-Null Count   Dtype 
---  ------                 --------------   ----- 
 0   Record ID              638454 non-null  int64 
 1   Agency Code            638454 non-null  object
 2   Agency Name            638454 non-null  object
 3   Agency Type            638454 non-null  object
 4   City                   638454 non-null  object
 5   State                  638454 non-null  object
 6   Year                   638454 non-null  int64 
 7   Month                  638454 non-null  object
 8   Incident               638454 non-null  int64 
 9   Crime Type             638454 non-null  object
 10  Crime Solved           638454 non-null  object
 11  Victim Sex             638454 non-null  object
 12  Victim Age             638454 non-null  int64 
 13  Victim Race            638454 non-null  object
 14  Victim Ethnicity       638454 non-null  object
 15  Perpetrator Sex        638454 non-null  object
 16  Perpetrator Age        638454 non-null  object
 17  Perpetrator Race       638454 non-null  object
 18  Perpetrator Ethnicity  638454 non-null  object
 19  Relationship           638454 non-null  object
 20  Weapon                 638454 non-null  object
 21  Victim Count           638454 non-null  int64 
 22  Perpetrator Count      638454 non-null  int64 
 23  Record Source          638454 non-null  object
dtypes: int64(6), object(18)
memory usage: 116.9+ MB
In [7]:
df_description = pd.read_excel('../Data/data_description.xlsx')
pd.set_option('display.max_colwidth', 0)
df_types = pd.DataFrame(df.dtypes, columns=['Data Type']).reset_index().rename(columns={'index': 'Attributes'})
df_description = pd.merge(df_description, df_types, on='Attributes', how='inner')
df_description
Out[7]:
Attributes Description Data Type
0 Record ID Unique record identifier generated by MAP based upon the YEAR, MONTH, INCIDENT and ORI code in the report. int64
1 Agency Code Unique Agency identifier. object
2 Agency Name The alphanumeric variable describing the name of the law enforcement agency making the report. object
3 Agency Type The type of law enforcement agency making the report. (e.g. Sheriff) object
4 City Alphanumeric variable describing the original FBI naming and abbreviating scheme for the city of the reporting agency. object
5 State Alphanumeric variable describing the original FBI naming and abbreviating scheme for the state of the reporting agency. object
6 Year Year of homicide (or when victim’s body was recovered.) Numeric four digit. (F4.0). int64
7 Month The month of homicide occurrence or when the victim’s body was recovered. object
8 Incident A three-digit number describing the case number within the month in which a homicide occurred. int64
9 Crime Type An alphanumeric variable defining whether the report was "Murder or Nonnegligent manslaughter” or “Manslaughter by Negligence.” object
10 Crime Solved MAP-generated indicator whether Offender was identified at time report was made (SOLVED=1) or not identified (SOLVED=0). object
11 Victim Sex Variable representing whether the victim was ”Male”, ”Female” or “Unknown” gender, usually for conditions in which incomplete remains were recovered. object
12 Victim Age Variable describing the age in years of the victim. A value of 0 indicates the victim had not achieved a full year of life, 99 represents all victims 99 or older, and 998 represents victims whose age was not reported, usually because the victim was unidentified and the age was unknown. int64
13 Victim Race Variable representing whether the victim's race object
14 Victim Ethnicity Variable representing whether the victim was “Hispanic Origin”, “Not of Hispanic Origin” or “Unknown or Not Reported.” Many agencies declined reporting the ethnicity of victims and offenders. object
15 Perpetrator Sex Variable representing whether the offender was”Male”, ”Female” or “Unknown” gender, usually in conditions in which the offender had not been identified at the time of the report. object
16 Perpetrator Age Variable describing the age in years of the offender. When the offender was not identified at the time of the report, age was reported as 999. A value of 99 represents all offenders 99 or older. object
17 Perpetrator Race Variable representing whether the offender's race, "unkown" usually in conditions in which the offender had not been identified at the time of the report. object
18 Perpetrator Ethnicity Variable representing whether the offender was “Hispanic Origin”, “Not of Hispanic Origin” or“Unknown or Not Reported.” Many agencies declined reporting the ethnicity of victims and offenders. object
19 Relationship Variable describing the relationship between the victim and the offender, if any. object
20 Weapon Variable representing the weapon used in the crime. object
21 Victim Count The number of additional victims (not counting the victim included in the current record) included in the Supplementary Homicide Report’s incident record, which can accept up to 10 additional victims in a single incident report. (F3.0) Incidents of mass murder of more than 11 victims would require multiple SHR incident reports. The Murder Accountability Project’s database captures all reported homicide victims as separate cases. Associated victims will have identical ID numbers int64
22 Perpetrator Count The number of additional offenders (not counting the offender included in the current record) included in the Supplementary Homicide Report’s incident record, which can accept up to 10 additional offenders in a single incident report. (F3.0) Unlike victims, the Murder Accountability Project does not create multiple case reports for additional offenders. Only the first offender listed in the original SHR report is included. int64
23 Record Source MAP-generated identifier if record provided by FBI or was obtained by MAP under the Freedom of Information Act from an Agency not participating in SHR reporting to the FBI. object

In order to retreive the total number of perpetrators, its assumed that there is an incindent per month which indicates the unique value of each case. Incident is not a total count of incidents for the one record. Each case has at least one perpetrator. Grouping by the incident, month, and perpetrator count, the number of unique records with the addition of the perpetrator count results in the total number of perpetrators.

In [8]:
perpetrators_df = df.groupby(['Incident', 'Month', 'Perpetrator Count']).count().reset_index()[['Incident', 'Month', 'Perpetrator Count']]
single_perpetrators = len(perpetrators_df)
additional_perpetrators = perpetrators_df['Perpetrator Count'].sum()
# additional_perpetrators = df['Perpetrator Count'].sum()
total_perpetrators = single_perpetrators + additional_perpetrators
print(f'Total number of perpetrators: {total_perpetrators}')
print(f'Total number of victims: {len(df)}')
Total number of perpetrators: 27451
Total number of victims: 638454

3. Data Quality


3.1 Missing Data

All missing data is categorized as 'unknown' or another designation is 998 for a victim or perpetrators age. As noted in the data meaning above, the missing data generally indicates that the perpetrator or victim is unknown at the time of the homicide reporting.

Indicated in the output below, there are no columns with null values.

In [9]:
# check for nulls and put into a dataframe
df_null = pd.DataFrame(df.isnull().sum(), columns=['null_count'])

# filter on null counts that are not 0
df_null.loc[df_null['null_count'] != 0]
Out[9]:
null_count

While no nulls were present, the data has ' ' as a value, which is replaced with 0 to align with unknown values. Considering that there is only one record which matches this criteria, it is concluded to be a data entry error. As for why 0 is used rather than 998 is due to the lack of 998 being utilized, and 0 indicating missing data.

In [10]:
blank_index = df.loc[df['Perpetrator Age']== ' '].index.values[0]
df.at[blank_index, 'Perpetrator Age'] = '0'
df.loc[df['Perpetrator Age']== 998]
Out[10]:
Record ID Agency Code Agency Name Agency Type City State Year Month Incident Crime Type ... Victim Ethnicity Perpetrator Sex Perpetrator Age Perpetrator Race Perpetrator Ethnicity Relationship Weapon Victim Count Perpetrator Count Record Source

0 rows × 24 columns

Another anomoly discovered is when the perpetrator's age is 0. This would mean a child under the age of 1 committed the homicide. This has very little credibility. When filtering for records matching this scenario, the majority return the remainder of the perpetrator's demographics as unknown. This indicates that the age of the perpetrator is likely unknown, rather than below one year of age. Something to note is that although the age is zero, there are 601 records which are solved and 4,648 that were not solved. This difference can be attributed to not collecting or filing this data into the record.

In [11]:
crime_age_0 = pd.DataFrame(df.loc[df['Perpetrator Age']== '0'].groupby('Crime Solved').count()['Record ID'].rename({'Record ID': 'Count'}))
crime_age_0
Out[11]:
Record ID
Crime Solved
No 4648
Yes 601

3.2 Duplicate Data

Although duplicates are not present in the dataset (table below), it should be noted that perpetrators that collected more than one victim would also have a unqiue record. This indicates that the count of records will produce the accurate number of victims, rather than the sum of additional victims in addition to the number of observations.

In [12]:
def checkDuplicate (df):
    df_duplicates = df.groupby(df.columns.tolist(),as_index=False).size()
    duplicates = len(df_duplicates.loc[df_duplicates['size'] > 1])
    if duplicates == 0:
        print('NO, duplicate instances are not present')
    else:
        print('Yes, duplicate instances are present')
# lets check if duplicate instances are present
dup_result = checkDuplicate(df)
NO, duplicate instances are not present

3.1 Outlier Data


The attributes below display the continuous atttributes in boxplots to determine outliers.

  1. Year
    There are not outliers for year. Reviewing the plot below, there are no outliers past the quartiles.
  2. Victim/Perpetrator Age
    While any age above ~70 and ~60 for victim and perpetrator, respectively, are considered to be outliers, the data is realistic of true age and has not been eliminated. While these cases are undoubtedly more rare, the removal of the records would result in a dataset which is much more general, leaving out a number of victims/murderers. It should be noted that for regression or classification, these will likely need to be removed for models to meet the assumption requirements.
  3. Victim/Perpetrator Count
    The boxplot shows that any counts, irrespective to victim or perpretrator, above 0 are considered to be outliers. This is due to majority of cases only have the single pair of people involved. The 'outliers' are not removed considering the additional perpetrators/victims provide valuable data which is important in the larger picture.
In [13]:
# create list of continuous variables
df['Perpetrator Age'] = df['Perpetrator Age'].astype(int)

continuous_att = np.delete(df.describe().columns.values, 0, 0) # remove record id
continuous_att = np.delete(continuous_att, 1, 0) # remove incident

# replace 0 with NaN 
df_continuous = df[list(continuous_att)]
df_continuous['Victim Age'] = df_continuous['Victim Age'].replace(998, np.NaN)
df_continuous['Perpetrator Age'] = df_continuous['Perpetrator Age'].replace(0, np.NaN)

# function to create boxplots
def create_boxplots(df, continuous):
    fig = make_subplots(rows=1, cols=len(continuous))
    for i in range(len(continuous)):
        fig.add_trace(go.Box(y=df[continuous[i]], name = continuous_att[i]), row = 1, col=i+1)
    fig.update_layout(title='Boxplot Outlier Detection')
    fig.show()

# call function
create_boxplots(df_continuous,continuous_att)

4. New Feature Creation


Due to the vast number of categories per variable, the attributes were further bucketed into more general groups to reduce complexity.

  1. Bucketing the age groups
  2. Combining race and ethnicity
  3. Combine city and state
  4. Combine month and year
  5. Binning relationships
  1. Age Groups
    While age is treated as a continuous variable, it can be binned into age groups of 10 (e.g. 0-10, 11-20, etc.). This allows for more generalization when analyzing for attributes that can predict the demographics of the perpetrator, considering that narrowing age down to a single value may prove to be difficult for investigators in pursuit.
In [14]:
# binning age
age_bins = np.array([0,10,20,30,40,50,60,70,80,90,100,998])
age_labels = ['0-10', '11-20','21-30','31-40','41-50','51-60','61-70','71-80','81-90', '91-100', '998']
df["Victim_Age_Group"] = pd.cut(df['Victim Age'].astype(int), age_bins, labels=age_labels, include_lowest=True)
df['Perpetrator Age']=df['Perpetrator Age'].replace(to_replace=" ",value=0)
df['Perpetrator Age'] = df['Perpetrator Age'].astype(int)
df['Perpetrator_Age_Group'] = pd.cut(df['Perpetrator Age'].astype(int), age_bins, labels=age_labels, include_lowest=True)
  1. Race/Ethnicity
    Ethnicity and race are combined to compare the relation of race/ethnicity as a single unit, creating a unique generation. This allows for comparison of race/ethnicity with a separate variable and determine the relationship.
In [15]:
# combine Victim and Perpetrator Race & Ethnicity into new features - Victim_Race_Ethnicity and Perpetrator_Race_Ethnicity
df['Victim_Race_Ethnicity'] = df['Victim Race'] + ', ' + df['Victim Ethnicity']
df['Victim_Race_Ethnicity'] = df['Victim_Race_Ethnicity'].str.replace(', Unknown','')
df['Perpetrator_Race_Ethnicity'] = df['Perpetrator Race'] + ', ' + df['Perpetrator Ethnicity']
df['Perpetrator_Race_Ethnicity'] = df['Perpetrator_Race_Ethnicity'].str.replace(', Unknown', '')
  1. City/State
    City and state are combined considering that there are city names which exist in multiple states (visible in the display table below of cities grouped by state). By combining the two, this creates a unique combination of city-state which allows for more accurate analysis.
In [16]:
# combine City and State into a new feature - City_State
df['City_State'] = df['City'] + ', ' + df['State']
df_city_count = df.groupby(['City', 'State']).count()['Record ID'].reset_index().groupby('City').count()
print(df_city_count['State'].loc[df_city_count['State'] > 1].head(3))
City
Adair     4 
Adams     12
Albany    2 
Name: State, dtype: int64
  1. Month/Year
    Bringing month and year together allows for a more in depth timeline which allows for a time series to be run, determinig whether there is a lag and predicting the rates of solved vs. unsolved cases in future years.
In [17]:
# combine Month and Year into a new feature - Month_Year
df['Month_Year'] = pd.to_datetime(df['Year'].astype(str)  + df['Month'], format='%Y%B')
  1. Relationship
    Relationship is broken into more generalized groups for exploration of case 2, in predicting the perpetrator. By taking a higher level view of the relationship, this allows the investigators to again have a wider net when searching for the potential suspects.
In [18]:
relationship_dict = {
    'Female Partner': ['Wife', 'Girlfriend', 'Ex-Wife', 'Common-Law Wife'],
    'Male Partner': ['Ex-Husband', 'Husband','Boyfriend', 'Common-Law Husband'],
    'Parent': ['Father','In-Law','Mother','Stepfather','Stepmother'],
    'Children': ['Daughter', 'Son', 'Stepdaughter','Stepson'],
    'Sibling': ['Brother', 'Sister'],
    'Work': ['Employee', 'Employer']
}
df['Relationship_Group'] = df['Relationship']
rel_replace = [[key for key, value in relationship_dict.items() if val in value][0] if len([key for key, value in relationship_dict.items() if val in value]) >0 else val for val in df['Relationship_Group']]
df['Relationship_Group'] = rel_replace

As displayed in the table below, the new attributes are now binned/combined accordingly and available for analysis.

In [19]:
df[['Victim_Age_Group','Perpetrator_Age_Group', 'Victim_Race_Ethnicity','Perpetrator_Race_Ethnicity', 
    'City_State', 'Month_Year','Relationship_Group']].head()
Out[19]:
Victim_Age_Group Perpetrator_Age_Group Victim_Race_Ethnicity Perpetrator_Race_Ethnicity City_State Month_Year Relationship_Group
0 11-20 11-20 Native American/Alaska Native Native American/Alaska Native Anchorage, Alaska 1980-01-01 Acquaintance
1 41-50 41-50 White White Anchorage, Alaska 1980-03-01 Acquaintance
2 21-30 0-10 Native American/Alaska Native Unknown Anchorage, Alaska 1980-03-01 Unknown
3 41-50 41-50 White White Anchorage, Alaska 1980-04-01 Acquaintance
4 21-30 0-10 Native American/Alaska Native Unknown Anchorage, Alaska 1980-04-01 Unknown

4.1 Data Wrangling

In [20]:
# data wrangling, clean-up, rename headers, drop columns, change data types, and transforms
# change crime solved values - Yes = 1 and No = 0 
df['Crime Solved']=df['Crime Solved'].replace(to_replace='No',value=0)
df['Crime Solved']=df['Crime Solved'].replace(to_replace='Yes',value=1)

4.2 Dummy Code


The categorical columns as dummy coded for additional analysis which are not capable of handling non-numerical data. While the dummy code method is utilized in this example, future data will be one-hot-encoded in response to sparse data. While the ordinal method is used for logistic regression in the exceptional section, its not recommended considering that ordinal implies an order to the coded categories. The new columns are removed to run the logistic regression on the original dataset.

In [21]:
# Function to create dummy variables
def dummy_code(col, df): # input the column names and dataframe
    df_dummy = pd.DataFrame()
    for val in col:
        df_dummy_temp = pd.get_dummies(df[val], prefix=val)
        df_dummy = pd.concat([df_dummy, df_dummy_temp], axis=1, sort=False)
    return df_dummy
In [22]:
# select columns for dummy coding
df1 = df
df1['Perpetrator Age'] = df1['Perpetrator Age'].astype('object')
df_categorical = df.drop(['Victim_Age_Group','Perpetrator_Age_Group', 'Victim_Race_Ethnicity','Perpetrator_Race_Ethnicity', 
    'City_State', 'Month_Year','Relationship_Group'], axis=1).select_dtypes(include='object')
cat_col = df_categorical.columns.values
categorical = np.delete(cat_col, [0,1])
In [23]:
# call function for dummy coding variables
df_dummy = dummy_code(categorical, df)
In [24]:
continuous_col = list(df.describe().columns.values)
df_continuous = df[continuous_col]
df_dummy = df_dummy.merge(df_continuous, on=None, left_index=True, right_index=True, how='outer')
df_dummy = df_dummy.merge(df[['Agency Name', 'Agency Code']], on=None, left_index=True, right_index=True, how='outer')

5. Simple Statistics


5.1 Continuous Attributes

Victim Age
The age range of victim is between 0 and 99 with average being about 33.6. Its is unfortunate to notice that victims can be as young as less than one year old. The 998 ('Unknown') values are artifically inflating the mean for victim age, hence its replaced with nan and rerun (as displayed below).

Perpetrator Age
The age range of the perpetrators is similar to victim ages, if the 998 ('Unknown') values are removed. Its worthy to note that there is a difference of greater than a decade between the mean victim vs. perpetrator age. In addition, nearly all victim quartiles are higher than perpetrator's.

Victim and Perpetrator Counts
The feature Victim Count and Perpetrator Count represents the number of additional victim or Perpetrator if any. We observed that majority of time there were no additional victim or Perpetrator present, minimum value being 0. There are some cases where up to 10 victim and/or Perpetrator were present in the crime incident.

In [25]:
# count, mean, standard deviation, minimum and maximum values and the quantities for continuous variables
df2 = df
df2['Perpetrator Age'] = df2['Perpetrator Age'].astype(int)
df2 = df2.replace(998, np.nan)
df2.describe().T
Out[25]:
count mean std min 25% 50% 75% max
Record ID 638453.0 319227.998438 184305.652746 1.0 159615.0 319228.0 478841.0 638454.0
Year 638454.0 1995.801102 9.927693 1980.0 1987.0 1995.0 2004.0 2014.0
Incident 638438.0 22.943489 92.021608 0.0 1.0 2.0 10.0 999.0
Crime Solved 638454.0 0.701964 0.457396 0.0 0.0 1.0 1.0 1.0
Victim Age 637480.0 33.562204 17.792594 0.0 22.0 30.0 42.0 99.0
Perpetrator Age 638454.0 20.322665 17.886846 0.0 0.0 21.0 31.0 99.0
Victim Count 638454.0 0.123334 0.537733 0.0 0.0 0.0 0.0 10.0
Perpetrator Count 638454.0 0.185224 0.585496 0.0 0.0 0.0 0.0 10.0

Correlation Matrix
The correlation matrix below reveals the correlation between the continuous variables, supporting the pairs matrix discussed later in the report.

In [26]:
# Correclation plot
plt.figure(figsize=(8,4))
sns.heatmap(df.corr())
plt.title('Correlation Between Continuous Variables')
Out[26]:
Text(0.5, 1, 'Correlation Between Continuous Variables')

5.2 Categorical Attributes

Many of the crimes committed were murder or manslaughter, indicating that the homicides are majority, intentional, rather than by Negligence. In addition, majority are committed within Los Angeles, California, with white males being the most likely victims. This should be contrasted with population and demographics to provide a well rounded image, as LA has a high population and provide an understanding of the percentages of white males in america's population.

In [27]:
# basic statistics for categorical features
df_categorical.describe()
Out[27]:
Agency Code Agency Name Agency Type City State Month Crime Type Victim Sex Victim Race Victim Ethnicity Perpetrator Sex Perpetrator Age Perpetrator Race Perpetrator Ethnicity Relationship Weapon Record Source
count 638454 638454 638454 638454 638454 638454 638454 638454 638454 638454 638454 638454 638454 638454 638454 638454 638454
unique 12003 9216 7 1782 51 12 2 3 5 3 3 100 5 3 28 16 2
top NY03030 New York Municipal Police Los Angeles California July Murder or Manslaughter Male White Unknown Male 0 White Unknown Unknown Handgun FBI
freq 38416 38416 493026 44511 99783 58696 629338 494125 317422 368303 399541 216328 218243 446410 273013 317484 616647

Classes Table
The table below displays the attributes, its varying levels, the number of unique levels, and the number of unknown values.

The count of 'Unknowns' for a few of the attributes is high. Its important to highlight that these 'Unknowns' generally are related to a homicide not being solved. The fact that there are agency names with 'Unknown' shows a lack of attention to detail when capturing the details of a homicide or it could indicate that there was an issue with jurisdication for the case.

In [28]:
# get all levels per categorical attribute
df_categorical_levels = pd.DataFrame()
df_categorical_levels['Attribute'] = df_categorical.columns
df_categorical_levels['Levels'] = ''
df_categorical_levels['Levels_Count'] = ''
df_categorical_levels['Unknown_Count'] = ''

# populate the dataframe with categorical levels and count of each category
for i, row in df_categorical_levels.iterrows():
    attribute = row['Attribute']
    df_categorical_levels.at[i,'Levels'] = df[attribute].unique()
    df_categorical_levels.at[i,'Levels_Count'] = len(df[attribute].unique())
    try:
        df_categorical_levels.at[i,'Unknown_Count'] = df.groupby(attribute).count().loc['Unknown'][0]
    except: 
        df_categorical_levels.at[i,'Unknown_Count'] = 0     
In [29]:
# show the dataframe
df_categorical_levels.sort_values(by='Unknown_Count', ascending = False).head(17)
Out[29]:
Attribute Levels Levels_Count Unknown_Count
13 Perpetrator Ethnicity [Unknown, Not Hispanic, Hispanic] 3 446410
9 Victim Ethnicity [Unknown, Not Hispanic, Hispanic] 3 368303
14 Relationship [Acquaintance, Unknown, Wife, Stranger, Girlfriend, Ex-Husband, Brother, Stepdaughter, Husband, Sister, Friend, Family, Neighbor, Father, In-Law, Son, Ex-Wife, Boyfriend, Mother, Common-Law Husband, Common-Law Wife, Stepfather, Stepson, Stepmother, Daughter, Boyfriend/Girlfriend, Employer, Employee] 28 273013
12 Perpetrator Race [Native American/Alaska Native, White, Unknown, Black, Asian/Pacific Islander] 5 196047
10 Perpetrator Sex [Male, Unknown, Female] 3 190365
15 Weapon [Blunt Object, Strangulation, Unknown, Rifle, Knife, Firearm, Shotgun, Fall, Handgun, Drowning, Suffocation, Explosives, Fire, Drugs, Gun, Poison] 16 33192
8 Victim Race [Native American/Alaska Native, White, Black, Unknown, Asian/Pacific Islander] 5 6676
7 Victim Sex [Male, Female, Unknown] 3 984
1 Agency Name [Anchorage, Juneau, Nome, Bethel, North Slope Borough, Kenai, Alaska State Police, Jefferson, Bessemer, Birmingham, Fairfield, Gardendale, Leeds, Homewood, Brighton, Hueytown, Warrior, Mobile, Prichard, Chickasaw, Saraland, Satsuma, Montgomery, Autauga, Baldwin, Robertsdale, Daphne, Barbour, Blount, Bullock, Butler, Greenville, Calhoun, Anniston, Oxford, Chambers, Lafayette, Chilton, Choctaw, Clarke, Thomasville, Cleburne, Enterprise, Colbert, Tuscumbia, Conecuh, Coosa, Rockford, Andalusia, Crenshaw, Cullman, Dale, Ozark, Dallas, Selma, Collinsville, Elmore, Tallassee, Atmore, Brewton, Attalla, Gadsden, Fayette, Franklin, Red Bay, Geneva, Greene, Houston, Dothan, Jackson, Scottsboro, Lamar, Lauderdale, Florence, Lawrence, Lee, Auburn, Opelika, Limestone, Athens, Lowndes, Tuskegee, Madison, Huntsville, New Hope, Owens Crossroads, Marengo, Hackleburg, Marshall, Albertville, Morgan, Decatur, Hartselle, Perry, Marion, Pickens, Troy, Russell, Phenix City, St. Clair, ...] 9216 47
11 Perpetrator Age [15, 42, 0, 36, 27, 35, 40, 49, 39, 29, 19, 23, 33, 26, 41, 28, 61, 25, 7, 17, 34, 21, 43, 38, 66, 32, 37, 22, 30, 24, 65, 51, 60, 45, 64, 18, 20, 44, 73, 62, 69, 52, 16, 90, 56, 47, 57, 31, 46, 14, 55, 50, 54, 68, 77, 53, 67, 48, 76, 58, 71, 63, 89, 13, 75, 72, 93, 59, 10, 11, 79, 74, 99, 70, 78, 80, 9, 12, 81, 6, 87, 82, 83, 8, 84, 85, 86, 5, 3, 1, 88, 95, 4, 91, 92, 96, 94, 2, 98, 97] 100 0
0 Agency Code [AK00101, AK00103, AK00106, AK00113, AK00118, AK00123, AKASP00, AL00100, AL00101, AL00102, AL00104, AL00105, AL00106, AL00110, AL00118, AL00119, AL00123, AL00200, AL00201, AL00202, AL00203, AL00205, AL00207, AL00300, AL00301, AL00400, AL00500, AL00505, AL00508, AL00600, AL00800, AL00900, AL01000, AL01001, AL01100, AL01101, AL01105, AL01200, AL01201, AL01400, AL01500, AL01600, AL01602, AL01800, AL01902, AL02000, AL02002, AL02100, AL02200, AL02202, AL02301, AL02400, AL02500, AL02501, AL02600, AL02601, AL02700, AL02701, AL02802, AL02900, AL02902, AL03001, AL03002, AL03101, AL03102, AL03200, AL03201, AL03300, AL03303, AL03400, AL03500, AL03800, AL03801, AL03900, AL03901, AL04000, AL04100, AL04101, AL04200, AL04300, AL04301, AL04302, AL04400, AL04401, AL04500, AL04601, AL04700, AL04701, AL04704, AL04705, AL04800, AL04904, AL05000, AL05001, AL05200, AL05201, AL05202, AL05300, AL05301, AL05400, ...] 12003 0
6 Crime Type [Murder or Manslaughter, Manslaughter by Negligence] 2 0
5 Month [January, March, April, May, June, July, August, December, November, February, October, September] 12 0
4 State [Alaska, Alabama, Arkansas, Arizona, California, Colorado, Connecticut, District of Columbia, Delaware, Florida, Georgia, Hawaii, Iowa, Idaho, Illinois, Indiana, Kansas, Kentucky, Louisiana, Massachusetts, Maryland, Maine, Michigan, Minnesota, Missouri, Mississippi, Montana, Nebraska, North Carolina, North Dakota, New Hampshire, New Jersey, New Mexico, Nevada, New York, Ohio, Oklahoma, Oregon, Pennsylvania, Rhodes Island, South Carolina, South Dakota, Tennessee, Texas, Utah, Virginia, Vermont, Washington, Wisconsin, West Virginia, Wyoming] 51 0
3 City [Anchorage, Juneau, Nome, Bethel, North Slope, Kenai Peninsula, Jefferson, Mobile, Montgomery, Autauga, Baldwin, Barbour, Blount, Bullock, Butler, Calhoun, Chambers, Chilton, Choctaw, Clarke, Cleburne, Coffee, Colbert, Conecuh, Coosa, Covington, Crenshaw, Cullman, Dale, Dallas, De Kalb, Elmore, Escambia, Etowah, Fayette, Franklin, Geneva, Greene, Houston, Jackson, Lamar, Lauderdale, Lawrence, Lee, Limestone, Lowndes, Macon, Madison, Marengo, Marion, Marshall, Morgan, Perry, Pickens, Pike, Russell, St. Clair, Shelby, Sumter, Talladega, Tallapoosa, Tuscaloosa, Walker, Washington, Wilcox, Winston, Arkansas, Ashley, Benton, Boone, Bradley, Chicot, Clark, Clay, Columbia, Conway, Craighead, Crawford, Crittenden, Drew, Faulkner, Fulton, Garland, Grant, Hempstead, Hot Spring, Lafayette, Lincoln, Little River, Lonoke, Miller, Mississippi, Newton, Ouachita, Phillips, Poinsett, Polk, Prairie, Pulaski, St. Francis, ...] 1782 0
2 Agency Type [Municipal Police, County Police, State Police, Sheriff, Special Police, Regional Police, Tribal Police] 7 0
16 Record Source [FBI, FOIA] 2 0

Pivot Table of Sex, Race, Ethnicity, and Case Solved
Crime Solved equal to 0 indicates an unsolved case. There are few cases where 'Unknown' is used for Victim Sex, Race, and Ethnicity, revealing that victims demographics are more complete than the perpetrator. This is intuitive considering the difficulty of declaring a homicide without confirming the death of a victim. Higher percentages of perpetrator demographics captured as 'Unknown' signifies the high number of homicides not solved. There are 189,786 records for Perpetrator with 'Unknowns' for Sex, Race, and Ethnicity, whereas Victims only has 600 records. Generally, when the sex of a victim or a perpetrator is 'Unknown' the homicide is not solved.

In [30]:
df_vict = pd.pivot_table(df,index=['Victim Sex','Victim Race', 'Victim Ethnicity'], columns=['Crime Solved'], 
               values=['Record ID'], aggfunc='count').reset_index().rename(columns={'Victim Sex':'Sex','Victim Race':'Race', 
                                                                                    'Victim Ethnicity': 'Ethnicity', "Record ID":'Victim'})
df_perp = pd.pivot_table(df,index=['Perpetrator Sex','Perpetrator Race', 'Perpetrator Ethnicity'], columns=['Crime Solved'], 
               values=['Record ID'], aggfunc='count').reset_index().rename(columns={'Perpetrator Sex':'Sex','Perpetrator Race':'Race',
                                                                                    'Perpetrator Ethnicity': 'Ethnicity', "Record ID":'Perpetrator'})
In [31]:
df_pv = pd.merge(df_vict, df_perp, how = 'outer', on=['Sex','Race','Ethnicity']).set_index(['Sex','Race','Ethnicity'])
df_pv
Out[31]:
Victim Perpetrator
Crime Solved 0 1 0 1
Sex Race Ethnicity
Female Asian/Pacific Islander Hispanic 3.0 8.0 NaN 5.0
Not Hispanic 339.0 1238.0 NaN 309.0
Unknown 262.0 1103.0 NaN 263.0
Black Hispanic 25.0 140.0 NaN 68.0
Not Hispanic 4506.0 13249.0 NaN 9895.0
Unknown 9634.0 24529.0 3.0 14682.0
Native American/Alaska Native Hispanic NaN 1.0 NaN 2.0
Not Hispanic 103.0 415.0 NaN 234.0
Unknown 135.0 564.0 NaN 342.0
Unknown Hispanic 3.0 15.0 NaN 6.0
Not Hispanic 11.0 29.0 NaN 25.0
Unknown 530.0 764.0 NaN 372.0
White Hispanic 2108.0 7502.0 NaN 2395.0
Not Hispanic 5903.0 22347.0 NaN 7923.0
Unknown 9514.0 38365.0 2.0 12022.0
Male Asian/Pacific Islander Hispanic 7.0 34.0 NaN 43.0
Not Hispanic 1388.0 2603.0 NaN 3006.0
Unknown 917.0 1986.0 NaN 2400.0
Black Hispanic 345.0 703.0 NaN 805.0
Not Hispanic 25962.0 57251.0 NaN 65797.0
Unknown 60632.0 102882.0 37.0 123097.0
Native American/Alaska Native Hispanic 3.0 7.0 NaN 14.0
Not Hispanic 308.0 1184.0 NaN 1307.0
Unknown 370.0 1476.0 NaN 1696.0
Unknown Hispanic 52.0 106.0 NaN 150.0
Not Hispanic 33.0 95.0 NaN 179.0
Unknown 1853.0 2300.0 3.0 5170.0
White Hispanic 22545.0 39044.0 NaN 43364.0
Not Hispanic 13356.0 47165.0 NaN 56271.0
Unknown 28782.0 80736.0 19.0 96183.0
Unknown Asian/Pacific Islander Unknown 1.0 1.0 19.0 NaN
Black Not Hispanic 3.0 2.0 12.0 NaN
Unknown 18.0 18.0 118.0 NaN
Native American/Alaska Native Unknown 1.0 NaN 5.0 NaN
Unknown Hispanic 1.0 NaN 9.0 NaN
Not Hispanic 1.0 1.0 200.0 NaN
Unknown 600.0 282.0 189786.0 147.0
White Not Hispanic 6.0 1.0 12.0 NaN
Unknown 22.0 26.0 44.0 NaN
Asian/Pacific Islander Not Hispanic NaN NaN 1.0 NaN
Black Hispanic NaN NaN 2.0 NaN
Native American/Alaska Native Hispanic NaN NaN 1.0 NaN
Not Hispanic NaN NaN 1.0 NaN
White Hispanic NaN NaN 8.0 NaN

6. Visualize Attributes


Cities Where Homicides Committed
Mapping the cities where homicides occured revelas that the homicide is well spread out across the US. There are only small pockets of locations where there are no homicides in our dataset. The highest volume of homicides appears in the central states, such as texas, western states, such as california, and eastern states, such as New York and New Jersey. Again, its noticed that majority of high homicide rates correlate with concentrated urban areas such as Los Angeles, New York, and Houston/Dallas.

In [32]:
# create data frame for unique cities in the dataset
df_cities = pd.read_csv('../Data/city_locations.csv')

homicide_per_city = df.groupby(['City']).count().reset_index()
df_cities = homicide_per_city.merge(df_cities, on = 'City', how='inner')

fig = px.density_mapbox(df_cities[['City','Latitude','Longitude','Incident']], lat='Latitude', lon='Longitude', z='Incident', radius=15,
                        center=dict(lat=40, lon=-100), zoom=3, 
                        mapbox_style="stamen-terrain")
fig.show()

Victim and Perpetrator Relationship
Of all the homicides recorded, majority of victim relationships to perpetrators were unknown, which, according to the attribute description, means the relationship could not be determined. It is possible that the agency had no record of the relationship between victim and perpetrator as well.

The other two most occurrence are Acquaintances and Strangers which leads us to believe that perpetrators were likely killing innocent people or the altercation occured by chance and may not have been a personal act of violence. The next highest category are female partner, potentially indicating domestic violence.

In [33]:
#source code: https://stackoverflow.com/questions/46623583/seaborn-countplot-order-categories-by-count/46624802
sns.countplot(y='Relationship_Group',
              order=df['Relationship_Group'].value_counts().index,  # order the bars 
               data = df,
                color='#3498DB') 
plt.xlabel('Total count',size=14)
plt.ylabel('Relationship Type',size=14)
plt.title('Victims Relationship To Perpetrators',size=18) 
plt.show() 
In [34]:
# aggregate by relationship
df_r_grouped = df.groupby(by='Relationship_Group')
print (df_r_grouped['Relationship_Group'].count().nlargest(len(df_r_grouped)))
print ('---------------------------------------')
print (round(((df_r_grouped['Relationship_Group'].count() / records) * 100.0).nlargest(len(df_r_grouped)),2))
Relationship_Group
Unknown                 273013
Acquaintance            126018
Stranger                96593 
Female Partner          44102 
Friend                  21945 
Children                19367 
Male Partner            18688 
Parent                  13817 
Family                  9535  
Sibling                 6806  
Neighbor                6294  
Boyfriend/Girlfriend    1383  
Work                    893   
Name: Relationship_Group, dtype: int64
---------------------------------------
Relationship_Group
Unknown                 42.76
Acquaintance            19.74
Stranger                15.13
Female Partner          6.91 
Friend                  3.44 
Children                3.03 
Male Partner            2.93 
Parent                  2.16 
Family                  1.49 
Sibling                 1.07 
Neighbor                0.99 
Boyfriend/Girlfriend    0.22 
Work                    0.14 
Name: Relationship_Group, dtype: float64

Weapons
In the plot below, Handguns were the most frequently used weapon followed by a knife. The high use of handguns could related to an increased volume of guns being produced and purchased. According to bureau of justice statistics, the number of guns produced in 1993 were 80% higher than in 1973. Another plausable reason could be related to an increased volume of stolen guns. Research conducted by the Virginia Department of Criminal Justice Services showed that in 1992 and 1993, 15% of adult offenders and 19% of juvenile offenders were in posession of or had stolen guns. The FBI also reported that from 1985 to 1994 they had an annual average of over 274,000 stolen gun reported. Hence we can see why handgun were used the most. When guns are grouped and compared to all other weapons, they make up 65.9% of weapons used in homicides.

source: https://www.bjs.gov/content/pub/pdf/GUIC.PDF

In [35]:
sns.countplot(y='Weapon',
               order=df['Weapon'].value_counts().index,
              data=df,
              color='#432371') 
plt.xlabel('Total Count',size=14)
plt.ylabel('Weapon Used',size=14)
plt.title('Type of weapons used by Perpetrators',size=18) 
plt.show() 
In [36]:
weapon_dict = {
    'Guns': ['Handgun', 'Firearm', 'Shotgun', 'Rifle', 'Gun']
}

df['Weapon_Group'] = df['Weapon']
rel_replace = [[key for key, value in weapon_dict.items() if val in value][0] if len([key for key, value in weapon_dict.items() if val in value]) >0 else val for val in df['Weapon_Group']]
df['Weapon_Group'] = rel_replace

# aggregate by weapon
df_grouped = df.groupby(by='Weapon_Group')
print (df_grouped['Weapon_Group'].count().nlargest(len(df_grouped)))
print ('---------------------------------------')
print (round(((df_grouped['Weapon_Group'].count() / records) * 100.0).nlargest(len(df_grouped)),2))
Weapon_Group
Guns             420739
Knife            94962 
Blunt Object     67337 
Unknown          33192 
Strangulation    8110  
Fire             6173  
Suffocation      3968  
Drugs            1588  
Drowning         1204  
Explosives       537   
Poison           454   
Fall             190   
Name: Weapon_Group, dtype: int64
---------------------------------------
Weapon_Group
Guns             65.90
Knife            14.87
Blunt Object     10.55
Unknown          5.20 
Strangulation    1.27 
Fire             0.97 
Suffocation      0.62 
Drugs            0.25 
Drowning         0.19 
Explosives       0.08 
Poison           0.07 
Fall             0.03 
Name: Weapon_Group, dtype: float64

Homicides by Year
The number of homicides in the US peaked in 1993, then started a downward trend. Forensic DNA analysis was introduced in the 1980's. The first reported use of DNA was in 1984 in England. It wouldn't be until 1987 when DNA was used to convict a Florida rapist. This happened 6 years before the spike in US homicides in 1993. The introduction of DNA may not have prevented to peak in 1993. In deeper analysis below we'll look to see if DNA had an impact on the number of solved versus not solved cases.

source: https://www.easydna.co.uk/knowledge-base/history-of-forensic-dna-analysis

In [37]:
df_yr = pd.DataFrame(df.groupby('Year').count()['Record ID'])
df_yr = df_yr.rename(columns={'Record ID':'Homicide Count'}).reset_index()
fig = px.line(df_yr, x='Year', y='Homicide Count', title='Homicides by Year')
fig.show() 

Victim and Perpetrator Sex
There is a descrepency between the victim and perpetrators sex percentages. Less than 1% of the time, a victims sex is unknown. Compared to 30% of the perpetrators sex is unknown. There is a 14% difference in female perpetrators compared to female victims. The percentage of male victims and perpetrators only shows a 14% difference.

In addition, the charts below reveal that majority of homicides are males towards other males.

In [38]:
# Percentage of Victim and Perpetrator sex
v_sex = df['Victim Sex'].value_counts()
p_sex = df['Perpetrator Sex'].value_counts()

fig, axes = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(12, 12)
v_sex.plot.pie(autopct='%1.0f%%', explode=(.1,.1,.1), shadow=True, 
              startangle=0,ax=axes[0])
p_sex.plot.pie(autopct='%1.0f%%', explode=(.1,.1,.1), shadow=True, 
              startangle=0,ax=axes[1])
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2d463490>

Agency Type
Municipal police (or local city police) handle homicide cases initially because they are responsible for their local jursidication. Only under special circumstances or if jursidication is in question will other agency types take ownership of working a homicide.

In [39]:
sns.countplot(y='Agency Type',
               order=df['Agency Type'].value_counts().index,
              data=df,
              color='#432371') 
plt.xlabel('Total Count',size=14)
plt.ylabel('Agency Type',size=14)
plt.title('Type of Agency That Work The Homicide',size=18) 
plt.show() 

7. Explore Joint Attributes


Pairs Matrix
Reviewing the attribute matrix below, no initial correlation between attributes are visible. Its notable that victim age is skewed due to the 998 unknown correlation.

In [40]:
df_matrix = df
fig = px.scatter_matrix(df[['Year', 'Incident', 'Victim Age', 'Victim Count','Perpetrator Count']])
fig.show()

Correlation Plot
supplemental to the pairs matrix and correlation plots referred to prior in the report, the table below reveals the correlations between attributes. Supporting the lack of correlation, all variables have weak relations, if any, to one another, indicated by the R^2 values. While there is an potential relationship between crime solved and perpetrator count, its not strong enough to be considered detrimental to statistical models.

In [41]:
# Correclation plot
corr_df = df.corr()
sns.heatmap(corr_df, cmap="YlGnBu", annot=True)
plt.title("Correlation matrix of Homicide data", y=-0.75)
plt.xlabel("Continuous variables")
plt.ylabel("Continuous variables")
plt.show()

Agency Type and States
The chart below breaks apart homicides by state and agency types. Again, Texas, California, and Neq York are the top contenders, supporting the prior analysis. Municipal police continue to be the primary agency reporting on homicides, with the sheriffs reporting a significant number in California and Texas. Despite New York having high homicide rates, the number of cases reported by sheriffs are relatively low.

In [42]:
ct = pd.crosstab(df['State'],
                  df['Agency Type']) 
ct.plot(kind='barh', 
        stacked=True,
         width=0.8,  
         figsize=(10,20) # (x-axis,y-axis)
         )
plt.xlabel('Total Count',size=14)
plt.ylabel('State',size=14)
plt.title('State and Agency Type Homicide Counts',size=18)
plt.show() 

Agency Type and Crime Type
The analysis below shows majority of violence from 1980-2014 were serious murder or manslaughter reported by manucipal police. Nearly all homicides reported fall under 'Muder of Manslaughter', suggesting that the homicides were intentional, and potentially premeditated. The low number of manslaughter by negligence is curious, considering that both manslaughter and manslaughter by negligence are categorized separately. Manslaughter in this case is coupled with murder, while the definition defines the differences between the two levels as negligence and recklessnessresource. This is due to the fact that recklessness assumes knowledge that the act was potentially harmful, despite intention, while negligence does not. The lines between the can easily be argued in court, but an arguement for recklessness appears to be the stronger of the two, assuming that 50% of the murder or manslaughter level is manslaughter.

In [43]:
ct = pd.crosstab(df['Agency Type'],
                  df['Crime Type']) 
ct.plot(kind='barh', 
         #stacked=True, 
         width=0.8,  
         figsize=(8,6) # (x-axis,y-axis)
         )
plt.xlabel('Total Count',size=14)
plt.ylabel('Agency Type',size=14)
plt.title('Murder or Manslaughter accounts for the majority of crime',size=18)
plt.show() 

Predicting Perpetrator Age from Victim Age

Perpetrators tend to kill victims in the same age group. Most homicides involve a 11-40 year old perpetrator killing a 11-40 year old victim. The most frequent pair of ages is a perpetrator 21-30 years old and a victim 21-30 years old. This is supported by the categorical statistics discuss prior, where the difference between the average ages is roughly 10 years. Combined with the data displaying that majority of relationship act (if not a stranger) are aquantices, the ages of teens to 20s are the most active in murdering acquaintances.

In [44]:
perp_not0 = df.loc[df['Perpetrator Age']!=0, ['Perpetrator_Age_Group','Victim_Age_Group']]
ct=pd.crosstab(perp_not0['Perpetrator_Age_Group'],perp_not0['Victim_Age_Group'])
sns.heatmap(ct)
plt.title('Homicide Rates by Perpetrator & Victim Age')
Out[44]:
Text(0.5, 1, 'Homicide Rates by Perpetrator & Victim Age')

Relationship, Victim Sex, Perpetrator Sex

There is ambiguity in the meaning of the 'Relationship' variable. If a homicide's relationship is 'Girlfriend', is the victim the girlfriend of the perpetrator? The answer is yes; From the 'Female Partner' subplot, note that most victims are female while most perpetrators are male. From the 'Male Partner' subplot, note that most victims are male while most perpetrators are female. Therefore, 'Relationship' can be interpretted as the victim's relationship to the perpetrator.
For most relationships, a male kills another male. The only anomoly is when the victim was the female partner. Therefore, if a victim is female and is known to be in a marriage/serious relationship, authorities should investigate if the perpetrator is the male partner.

In [45]:
#https://jakevdp.github.io/PythonDataScienceHandbook/04.08-multiple-subplots.html


fig=plt.figure()
fig.set_size_inches(16, 16)
fig.subplots_adjust(hspace=0.4, wspace=0.4)


relationshipz=df['Relationship_Group'].unique()
for i in range(1,len(relationshipz)):
  relationship=relationshipz[i]  
  ct=pd.crosstab(df.loc[df['Relationship_Group']==relationship,'Victim Sex'],
               df.loc[df['Relationship_Group']==relationship,'Perpetrator Sex'])
  ax=fig.add_subplot(4,4,i)
  sns.heatmap(ct,ax=ax)
  plt.title(relationship)

Victim Sex and Perpetrator Sex
The heat map below shows that crime rate is high when both perpetrator and victim are male, again, supporting the prior pie charts assumptions. Majority of all crimes, despite perpetrator sex are towards males, indicating that males are the highest targeted sex when it comes to homicide.

In [46]:
ct = pd.crosstab(df['Victim Sex'],
                  df['Perpetrator Sex'],
                         normalize=True, 
                        margins=True).round(3)*100  
In [47]:
plt.figure(figsize=(6,4)) 
cmp = sns.cubehelix_palette(start=0,light=1,as_cmap=True)
sns.heatmap(pd.crosstab(df['Perpetrator Sex'],df['Victim Sex'], normalize=True)*100,
            annot=True,fmt=".4f",linewidths=.5,cmap='coolwarm')

plt.xlabel('Victim Sex',size=10)
plt.ylabel('Perpetrator Sex',size=10)
plt.title('Same Sex Crime Were The Highest',size=14,color='black') 
plt.show() 

Victim Race and Victim Sex
The table below displays majority victims were of white and black race, respectively.

Another interesting thing from this table is that white female, 59.8% were the highest among all the women of different race to be the victim of crime. Again, in order to ensure a well rounded picture, population and demographic data should be compared to understand the higher rates of homicide. While the percentage below is alarming, an understanding of homicides by race, in relation to the entire population of the US may reveal a different narrative.

In [48]:
# Check to see which race of population who are victim of crime based on race.
ct = pd.crosstab(df['Victim Race'],
                         df['Victim Sex'],
                        normalize='columns').round(3)*100  
ct.T 
Out[48]:
Victim Race Asian/Pacific Islander Black Native American/Alaska Native Unknown White
Victim Sex
Female 2.1 36.3 0.8 0.9 59.8
Male 1.4 50.1 0.7 0.9 46.9
Unknown 0.2 4.2 0.1 89.9 5.6

Weapon and Perpetrators Race
The heatmap suggests that majority of perpetrators who used a handgun were of black, and white for known races. Supplemental to prior inferences, handguns are the top weapon with respect to homicide, followed by knives and blunt objects. Again, population data should be referred to as the visual below does not provide sufficient information to allow a well rounded conclusion.

In [49]:
ct = pd.crosstab(df['Perpetrator_Race_Ethnicity'], 
                 df['Weapon'],
                    normalize=True ).round(4)*100 
In [50]:
plt.figure(figsize=(12,8))
wphmp = sns.cubehelix_palette(start=0,light=1,as_cmap=True)
sns.heatmap(ct,annot=True,cmap='gist_heat',fmt='.2f',linewidths=.0)
plt.xlabel('Type Of Weapon',size=14)
plt.ylabel('Race/Ethnicity Of Perpetrator',size=14)
plt.title('Weapons by Perpetrator Race',fontsize=18,color='black')
plt.show()

Weapon and Perpetrators Sex
The heatmap of weapon by sex suggests women use handguns and knives primarily in homicides. According to research conducted on 510 high-risk sample of female between age group 14-17 showed that knives was the most frequently used weapon among women. Overall, handguns are the most popular weapon across genders, followed by knives. There is no major trend with respect to gender, other than males tend to use varying weapons.

Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2438581/

In [51]:
persex_wep_ct = pd.crosstab(df['Weapon'],
                  df['Perpetrator Sex'],
                    normalize= True).round(3)*100 
plt.figure(figsize=(12,8))
wphmp = sns.cubehelix_palette(start=0,light=1,as_cmap=True)
sns.heatmap(persex_wep_ct,annot=True,cmap='PuRd_r',fmt='.2f',linewidths=.0)
plt.xlabel('Perpetrator Sex',size=14)
plt.ylabel('Weapon Used',size=14)
plt.title('Weapon by Sex',fontsize=18,color='#FF1493')
plt.show() 

Perpetrator Ages and Sex
Perpetrators of age group 21-30 and male are responsible for majority of crime. It should be noted that many of the ages sex are unknown (Age 0). The peak age for homicide appears to be ages 21-30, of which the trend follows a decline with increased age.

In [52]:
ct = pd.crosstab(df.Perpetrator_Age_Group,
           df['Perpetrator Sex'],    
           normalize=True).round(4)*100

ct.plot(kind='bar',
        stacked=True)
plt.xlabel('Percentage of Total Homicides')
plt.ylabel('Perpetrator Age Range')
plt.title('Age and Sex of Perpetrators')
plt.show()

Perpetrator Race
The data suggests that perpetrators of Asian/Pacific Islander and primarily male worked in a group of additional perpetrators when carrying out criminal activities. While Asian/Pacific Islanders has the highest count, the variance between races did not vary significantly. The data also suggests that Native American/ Alaska Native women are among the highest to work group to carryout homicidal activities compared to women of other race.

In [53]:
sns.catplot(x='Perpetrator Race',
           y='Perpetrator Count',
           kind='bar', 
           height=6,
            aspect=2,
            hue='Perpetrator Sex', 
            palette=['#432371',"#FAAE7B",'red'],
           data=df.sort_values('Perpetrator Race'))
plt.xlabel('Perpetrator Race',size=14)
plt.ylabel('Average of Perpetrator Count',size=14)
plt.title('On Average male of Asian/Pecific Islander race Perpetrator count were the highest',size=18)
plt.show()

8. Explore Attributes and Class


Weapon Vs Crime Solved
The data below suggests that most of the identified weapons for unsolved crimes are committed with handguns, knives and blunt objects. Handguns alone account for 50% of homicides. Handguns have around a 60% chance of crime solved while Knives have 20%, Firearms have 50%, Blunt Objects have 20%.

In [54]:
ct_weapon= pd.crosstab(df['Weapon'],
           df['Crime Solved'],
           normalize=True)

ct_weapon.rename(columns={0:'N',1:'Y'}).plot(kind='barh',
               stacked=True,
               width=0.8,  
               figsize=(8,6) # (x-axis,y-axis)
                        )
plt.xlabel('Percentage of Total Homicides',size=13)
plt.ylabel('Weapon',size=13)
plt.title('Relation between weapon used and homicide percent based on crime solved',size=16)  
plt.show() 

Homicide Solved/Unsolved per Year
There was a drop in the total number of homicides in 1984 followed by sharp increase in the national crime rate in the year between 1990 and 1994. The plot below suggests that homicide rates were high in years 1990-1994. There is a small increase in 2005 and 2006 and then resumes its downward trend until around 2015. 1993 appears to be the year with the highest homicide rates Despite the fluctuating number of records, the difference bewtween cases solved and unsolved follows the increasing and decreasing overall trends, not showing any major drops or increases.

In [55]:
df_homicides_per_year = df.groupby('Year').count().reset_index()
df_homicides_per_year_solved = df.groupby(['Year', 'Crime Solved']).count().reset_index()

# Create traces
solved_y = df_homicides_per_year_solved.loc[df_homicides_per_year_solved['Crime Solved'] == 1]
unsolved_y = df_homicides_per_year_solved.loc[df_homicides_per_year_solved['Crime Solved'] == 0]
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_homicides_per_year['Year'], y=df_homicides_per_year['Record ID'],
                    mode='lines+markers',
                    name='Homicide Rates'))
fig.add_trace(go.Scatter(x=solved_y['Year'], y=solved_y['Record ID'],
                    mode='lines+markers',
                    name='Solved Homicides'))
fig.add_trace(go.Scatter(x=unsolved_y['Year'], y=unsolved_y['Record ID'],
                    mode='lines+markers',
                    name='Unsolved Homicides'))
fig.update_layout(
    title={
        'text': "Homicides Per Year",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    xaxis_title="Year",
    yaxis_title=" Number of Homicides")

fig.show()

Victim Age/Sex Crime Solved Over Time
Majority of Crime is towards men in their teen-20s. Across all years, the number of female homicide victims are less than half of male homicide victims. There seems to be two peaks in number of homicides. The first peak at victim age 0 has around 100 to 200 baby homicides every year. The second peak centered at around 23 years victim age seems to be right skewed. In summary, victims tend to be male in 20s or 30s and there is a consistent number of less than one year old victims every year.

The different shapes signify the Crime Solved Status of each set of homicides. The victim age with the greatest difference between solved and unsolved crime is late 20. The late 20s is also the peak of the distribution. The number of unsolved female victim homicides are roughly half of the number of solved female victim homicides. The number of unsolved male victim homicides appears to only be a third of the solved male victim homicides.

In [56]:
# Victim (Sex, Age) vs Crime Solved
df_victime_gender = df.groupby(['Victim Age', 'Victim Sex', 'Year', 'Crime Solved']).count().reset_index()
fig=px.scatter(df_victime_gender, x="Victim Age", y="Record ID", animation_frame="Year", animation_group="Victim Age",
           size="Record ID", color="Victim Sex", hover_name="Record ID", symbol = 'Crime Solved',
           log_x=False, size_max=20, range_x=[0,100], range_y=[0,1200])
fig.update_yaxes(title_text='Number of Homicides')
fig.update_layout(
    title="Sex and Age of Victims Across Time Based on Crime Solved")
fig.show()

Relationship Group and Crime Solved
Below, having an unknown Relationship has the lowest probability of a crime being solved. This is intuitive, considering crimes in which little information is known about the victims or perpetrators are less likely to be solved. Out of the remaining relationship groups, Strangers have the lowest crime solving probability at around 85%.

In [57]:
ct_relationship = pd.crosstab(df['Relationship_Group'],
           df['Crime Solved'],
           normalize=True)

#ct_relationship.loc[~ct_relationship.index.isin(['Unknown']),:].rename(columns={0:'N',1:'Y'}).plot(kind='barh',stacked=True)
ct_relationship.rename(columns={0:'N',1:'Y'}).plot(kind='barh',
                            stacked=True,
                           )
plt.xlabel('Percentage of Total Homicides')
plt.title('Probability of Crime Solved Relative to Relationship Group')
Out[57]:
Text(0.5, 1.0, 'Probability of Crime Solved Relative to Relationship Group')

Agency Type and Crime Solved
Municipal Police and Sheriff agencies report nearly 90% of all homicides. Municipal Police have the lowest chance of a crime being solved, roughly 66% while Sheriffs have a solving probability of 80%.

Agency Type may be a good feature to be used as a predictor for Crime Solved due to the disparity in crime solving percentage between the two predominant Agency Types.

In [58]:
agency_ct = pd.crosstab(df['Agency Type'],df['Crime Solved'],normalize=True)
agency_ct.rename(columns={0:'N',1:'Y'}).plot(kind='barh',stacked=True)
Out[58]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2cd96590>

Crime Per Season
In analyzing the crimes solved per season, the crimes solved increase for summer and decrease in winter. It should be noted that the overall crimes increase in summer and gradually drop in winter. This indicates that as avergage temperatures increase, crimes also increase, dropping as temperatures also drop.

In [59]:
winter = ['December', 'January', 'Febreuary']
spring = ['March','April', 'May']
summer = ['June', 'July', 'August', 'September']
fall = ['October', 'November']
df['Season'] = ''
df['Season']=np.where(df.Month.isin(winter),'Winter',df.Season);
df['Season']=np.where(df.Month.isin(spring),'Spring',df.Season);
df['Season']=np.where(df.Month.isin(summer),'Summer',df.Season);
df['Season']=np.where(df.Month.isin(fall),'Fall',df.Season);

pd.crosstab(df.Season,df['Crime Solved']).plot(kind='bar')
plt.title('Crime Frequency Per Season',size=18)
plt.xlabel('Season',size=14)
plt.ylabel('Frequency Of Crime',size=14)
plt.show()

9. Exceptional Work


Display the dummy coded dataset.

In [60]:
try:
    df_full = df_dummy
except:
    df_full = pd.read_csv('../Data/Dummy_coded_database.csv')
    df_full = df_full.drop('Unnamed: 0', axis=1)
df_full.head()
Out[60]:
Agency Type_County Police Agency Type_Municipal Police Agency Type_Regional Police Agency Type_Sheriff Agency Type_Special Police Agency Type_State Police Agency Type_Tribal Police City_Abbeville City_Acadia City_Accomack ... Record Source_FOIA Record ID Year Incident Crime Solved Victim Age Victim Count Perpetrator Count Agency Name Agency Code
0 0 1 0 0 0 0 0 0 0 0 ... 0 1 1980 1 1 14 0 0 Anchorage AK00101
1 0 1 0 0 0 0 0 0 0 0 ... 0 2 1980 1 1 43 0 0 Anchorage AK00101
2 0 1 0 0 0 0 0 0 0 0 ... 0 3 1980 2 0 30 0 0 Anchorage AK00101
3 0 1 0 0 0 0 0 0 0 0 ... 0 4 1980 1 1 43 0 0 Anchorage AK00101
4 0 1 0 0 0 0 0 0 0 0 ... 0 5 1980 2 0 30 0 1 Anchorage AK00101

5 rows × 2031 columns

9.1 Downsampling the Data

Down sampling is used to balance out the data. Resource Its noticed that the dataset is skewed towards solved data ( ~30% for unsolved, and ~70% for solved cases). By balancing out the dataset, this allows for model to have a greater probability of achieving high specificity and sensitivity.

The data is first split into a test/train set before being downsampled. This allows for a percentage of the less populated level to be retained in the test set for validation purposes. The training set is downsampled, then removed from the original data to generate the final train/test splits.

In [61]:
# set seed
random.seed(1234)
df_clean = df_full.drop(['Agency Name', 'Agency Code'], axis=1)
# split into train/test
y = df_clean['Crime Solved']
x = df_clean.drop(['Crime Solved'], axis = 1)
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.8)
In [62]:
# check for a balanced dataset
df_crime = df_full[['Crime Solved', 'Record ID']].groupby('Crime Solved').count().reset_index().rename(columns={'Record ID':'Count'})
df_crime['Solved'] = ['No', 'Yes']
df_crime = df_crime.drop('Crime Solved', axis=1)
total = df_crime['Count'].sum()
df_crime['Percentage'] = [x/total for x in df_crime['Count']]
df_crime.head()
Out[62]:
Count Solved Percentage
0 190282 No 0.298036
1 448172 Yes 0.701964
In [63]:
training_df = x_train
training_df['Crime Solved'] = y_train
In [64]:
# Separate majority and minority classes
df_majority = training_df[training_df['Crime Solved']==1]
df_minority = training_df[training_df['Crime Solved']==0]
 
# Downsample majority class
df_majority_downsampled = resample(df_majority, 
                                 replace=False,    # sample without replacement
                                 n_samples=len(df_minority),     # to match minority class
                                 random_state=123) # reproducible results
 
# # Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_minority])
In [65]:
# Display new class counts
df_downsampled.groupby('Crime Solved').count()['Agency Type_County Police']
Out[65]:
Crime Solved
0    38156
1    38156
Name: Agency Type_County Police, dtype: int64

Create a new split based off the balanced training sets

Due to the large dataset size, rather than doing a traditional train split, the method below resulted in quicker computation and prevented compuatational timeout

In [66]:
# df_training = pd.read_csv("../Data/training_data_set.csv")
df_training = df_downsampled.reset_index()
index_train = df_training['index']
df_training = df_training.drop('index', axis=1)
In [67]:
# index_train.values
full_ind = df_full.index.values
train_ind = index_train.values
mask = np.isin(full_ind, train_ind, invert=True)
test_ind = full_ind[mask]
print('actual no. of records: ' + str(len(test_ind))+', expected no. of records: ' + str(len(df_full) - len(index_train.values) ))
if len(test_ind) == (len(df_full) - len(index_train.values)):
    df_test = df_full.iloc[mask]
    print('Validation Complete')
actual no. of records: 562142, expected no. of records: 562142
Validation Complete

Final Test/Train Split

In [68]:
x_train = df_training.drop('Crime Solved', axis=1)
x_test =  df_test.drop('Crime Solved', axis=1)
y_train = df_training['Crime Solved']
y_test = df_test['Crime Solved']
  1. PCA is only transformable on continuous data. While it can be run on dummy coded categorical variables, PCA is meant only for continuous attributes, and deviation can result in midleading conclusions.

  2. Since PCA is sensitive to scales, the first step is to scale the data Resource

In [69]:
continuous_col = list(df.describe().columns.values)
continuous_col.remove('Crime Solved')
pca_train = df[continuous_col]
pca_x_train = pca_train.loc[train_ind]
pca_x_train.head()
Out[69]:
Record ID Year Incident Victim Age Perpetrator Age Victim Count Perpetrator Count
341627 341628 1996 1 36 45 2 0
288270 288271 1993 27 30 45 0 0
247784 247785 1992 1 45 45 1 0
413530 413531 2000 1 22 34 0 0
573273 573274 2010 1 27 25 0 0

From the graph below of PCA1 vs. PCA2, note the strong overlap in the two variables. This indicates that there is not a significant separation between the two PCA variables, potentially indicating that the continuous variables in question are not significantly different.

In [70]:
# Standardizing the features
x = StandardScaler().fit_transform(pca_x_train)

pca = PCA(n_components=6)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponents, columns=['PCA_'+ str(x) for x in range(6)])
# Concat the Data
df_PCA = pd.concat([principalDf, y], axis=1)

# create visual
fig = px.scatter(principalComponents, x=df_PCA['PCA_0'], y=df_PCA['PCA_1'], color=df_PCA['Crime Solved'],
                width=600, height=300)
fig.update_layout(title='PCA 1 vs. PCA 2',
                  yaxis_zeroline=False, xaxis_zeroline=False)
fig.update_xaxes(title_text='PCA 1')
fig.update_yaxes(title_text='PCA 2')
fig.show()

9.3 Feature Selection

Feature Selection

The feature selection used in python is Recursive Feature Elimination(RFE), which recursively selects smaller and smaller amount of features until the final recommended features are selected.

First, feature selection on the attributes without dummy coding are tested to narrow down the features to the top 5.Resource Then, the top 5 features are run through RFE to further eliminate unnecessary attributes. Resource. This method is primarily used due to the large dataset size. By narrowing down the data without dummy coding, the lower requirement for computation allows quicker return times. The top 5 attributes are then selected for RFE using the dummy coded data before running through logistic regression.

In [71]:
# These function from the resource for review
# load the dataset
def load_dataset(filename):
    # load the dataset as a pandas DataFrame
    data = read_csv(filename, header=None)
    # retrieve numpy array
    dataset = data.values
    # split into input (X) and output (y) variables
    X = dataset[:, :-1]
    y = dataset[:,-1]
    # format all fields as string
    X = X.astype(str)
    return X, y

# prepare input data
def prepare_inputs(X_train, X_test, all_test):
    oe = OrdinalEncoder()
    oe.fit(all_test)
    X_train_enc = oe.transform(X_train)
    X_test_enc = oe.transform(X_test)
    return X_train_enc, X_test_enc

# prepare target
def prepare_targets(y_train, y_test):
    le = LabelEncoder()
    le.fit(y_train)
    y_train_enc = le.transform(y_train)
    y_test_enc = le.transform(y_test)
    return y_train_enc, y_test_enc

# feature selection
def select_features(X_train, y_train, X_test):
    fs = SelectKBest(score_func=chi2, k='all')
    fs.fit(X_train, y_train)
    X_train_fs = fs.transform(X_train)
    X_test_fs = fs.transform(X_test)
    return X_train_fs, X_test_fs, fs

The graph below suggest the top five attributes which are the top predictors for whether or not a homocide is solved are

  1. Agency Code
  2. Perpetrator Age
  3. Relationship
  4. Sex
  5. Perpetrator Count

Next, using logistic regression and RFE, the attributes will be reduced further.

In [72]:
df_logit = df.drop(['Record ID', 'Incident', 'Victim_Age_Group','Perpetrator_Age_Group', 'Victim_Race_Ethnicity','Perpetrator_Race_Ethnicity', 
    'City_State', 'Month_Year','Relationship_Group'], axis=1)

x_df = df_logit.drop(['Crime Solved'], axis=1)
y_df = df_logit['Crime Solved']
x_train_m,x_test_m,y_train_m,y_test_m=train_test_split(x_df,y_df,test_size=0.8)

x_train_m['Crime Solved'] = y_train_m
df_majority_m = x_train_m[x_train_m['Crime Solved']==1]
df_minority_m = x_train_m[x_train_m['Crime Solved']==0]
 
# Downsample majority class
df_majority_downsampled_m = resample(df_majority_m, 
                                 replace=False,    # sample without replacement
                                 n_samples=len(df_minority_m),     # to match minority class
                                 random_state=123) # reproducible results

df_majority_downsampled_m = pd.concat([df_majority_downsampled_m, df_minority_m])

df_test_m = df_logit.merge(df_majority_downsampled_m, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='left_only']

x_m = df_majority_downsampled_m.drop('Crime Solved', axis=1).astype(str)
y_m = df_majority_downsampled_m['Crime Solved'].astype(str)
x_test_m = df_test_m.drop(['Crime Solved','_merge'], axis=1).astype(str)
y_test_m = df_test_m['Crime Solved'].astype(str)
x_test_all = df_logit.drop('Crime Solved', axis=1).astype(str)

# prepare input data
X_train_enc, X_test_enc = prepare_inputs(x_m, x_test_m, x_test_all)
# prepare output data
y_train_enc, y_test_enc = prepare_targets(y_m, y_test_m)
# feature selection
X_train_fs, X_test_fs, fs = select_features(X_train_enc, y_train_enc, X_test_enc)

features = pd.DataFrame()
features['Attributes'] = x_m.columns
features['Feature Selection'] = fs.scores_.round(2)
features.sort_values(by='Feature Selection', ascending=False).head(5)

# what are scores for the features
fig = px.bar(features, x='Attributes', y='Feature Selection')
fig.show()
In [73]:
top_5 = features.sort_values(by='Feature Selection', ascending=False).head(6)['Attributes'].values
filter_col = [x for x in x_train if x.startswith(tuple(top_5))]
x_train2 = x_train[filter_col]

# from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
rfe = RFE(logreg, 5)
rfe = rfe.fit(x_train2, y_train.values.ravel())

rfe_features = pd.DataFrame()
rfe_features['Attributes'] = x_train2.columns
rfe_features['Significant'] = rfe.support_
rfe_features.loc[rfe_features['Significant'] == True]

rfe_top_5 = rfe_features.loc[rfe_features['Significant'] == True]['Attributes'].values

x_train3 = x_train2[rfe_top_5]

rfe_features.loc[rfe_features['Significant'] == True]['Attributes'].values
Out[73]:
array(['Perpetrator Sex_Female', 'Perpetrator Sex_Male',
       'Perpetrator Sex_Unknown', 'Perpetrator Age_0',
       'Relationship_Unknown'], dtype=object)

9.4 Logistic Regression

For the final logistic regression, the model is only fed the attributes selected from previous feature selection in an effort to reduce run time. Resource. Confirming the p-values are less than 0.05, the attributes are fed into logistic regression and tested.

In [74]:
logit_model=sm.Logit(y_train,x_train3[['Perpetrator Sex_Male',
       'Perpetrator Sex_Unknown', 'Perpetrator Age_0',
       'Relationship_Unknown']])
result=logit_model.fit()
print(result.summary2())

x_train4 = x_train3[['Perpetrator Sex_Male',
       'Perpetrator Sex_Unknown', 'Perpetrator Age_0',
       'Relationship_Unknown']]

logreg = LogisticRegression()
logreg.fit(x_train4, y_train)

x_test2 = x_test[['Perpetrator Sex_Male',
       'Perpetrator Sex_Unknown', 'Perpetrator Age_0',
       'Relationship_Unknown']]
age_0 = x_test2.iloc[:,2]
x_test2 = x_test2.drop('Perpetrator Age_0', axis=1)
x_test2['Perpetrator Age_0'] = age_0

# print the accuracy of logit
y_pred = logreg.predict(x_test2)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(x_test2, y_test)))
Optimization terminated successfully.
         Current function value: 0.037777
         Iterations 12
                              Results: Logit
===========================================================================
Model:                   Logit               Pseudo R-squared:    0.945    
Dependent Variable:      Crime Solved        AIC:                 5773.6066
Date:                    2020-09-13 01:57    BIC:                 5810.5769
No. Observations:        76312               Log-Likelihood:      -2882.8  
Df Model:                3                   LL-Null:             -52895.  
Df Residuals:            76308               LLR p-value:         0.0000   
Converged:               1.0000              Scale:               1.0000   
No. Iterations:          12.0000                                           
---------------------------------------------------------------------------
                         Coef.   Std.Err.    z     P>|z|   [0.025   0.975] 
---------------------------------------------------------------------------
Perpetrator Sex_Male      7.6710   0.2888  26.5658 0.0000   7.1051   8.2370
Perpetrator Sex_Unknown -13.4100   0.5515 -24.3158 0.0000 -14.4910 -12.3291
Perpetrator Age_0         2.1724   0.3985   5.4518 0.0000   1.3914   2.9534
Relationship_Unknown      3.5272   0.2928  12.0449 0.0000   2.9533   4.1012
===========================================================================

Accuracy of logistic regression classifier on test set: 1.00

While initially, the accuracy of 1 causes doubt, the confusion matrix reveals that due to the large sample size, the data is near to 1 for accuracy.

In [75]:
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)

plot_confusion_matrix(logreg, x_test2, y_test)  # doctest: +SKIP
plt.show()  # doctest: +SKIP
[[152074     52]
 [   131 409885]]
In [76]:
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    152126
           1       1.00      1.00      1.00    410016

    accuracy                           1.00    562142
   macro avg       1.00      1.00      1.00    562142
weighted avg       1.00      1.00      1.00    562142

The ROC curve supports the confusion matrix, showing the AUC to be 1. This appears to be somewhat misleading considering that 1 variable is sufficient to accurately determine whether a case will be solved.

In [77]:
logit_roc_auc = roc_auc_score(y_test, logreg.predict(x_test2))
fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(x_test2)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

10. Works Cited